ReCell Project

Background:

Buying and selling used smartphones used to be something that happened on a handful of online marketplace sites. But the used and refurbished phone market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth $52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used smartphones that offer considerable savings compared with new models.

Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing a smartphone. There are plenty of other benefits associated with the used smartphone market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished smartphones. Maximizing the longevity of mobile phones through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost the cheaper refurbished smartphone segment, as consumers cut back on discretionary spending and buy phones only for immediate needs.

Objective:

The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished smartphones. I have been hired as a Data Scientist by ReCell, a startup aiming to tap the potential in this market. They want me to analyze the data provided and build a linear regression model to predict the price of a used phone and identify factors that significantly influence it.

Data Dictionary:

The data contains the different attributes of used/refurbished phones. The detailed data dictionary is given below

Importing necessary libraries and data

Load the dataset

Data Overview

Observations

Observations

Data Preprocessing

Renaming the Columns

Fixing the datatypes

Missing Value Treatment

As there are less number of missing values in the Numeric Columns, we will replace the missing values in each column with its median.

Observations

Checking Duplicate Values

No duplicate values found

Exploratory Data Analysis (EDA)

Univariate Analysis

Let us explore the numerical variables first

Univariate Analysis of screen_size

Observations

Univariate Analysis of main_camera_mp

Observations

Univariate Analysis of selfie_camera_mp

Observations

Univariate Analysis of int_memory

Observations

Univariate Analysis of ram

Observations

Univariate Analysis of battery

Observations

Univariate Analysis of weight

Observations

Univariate Analysis of release_year

Observations

Univariate Analysis of days_used

Observations

Univariate Analysis of new_price

Observations

Univariate Analysis of used_price

Observations

Let us now explore the categorical variables

Univariate Analysis of brand_name

Observations

Univariate Analysis of os

Observations

Univariate Analysis of 4g

Observations

Univariate Analysis of 5g

Observations

Bivariate Analysis

Plot bivariate charts between numeric variables to understand their interaction with each other.

Correlation

Observations

Bivariate Scatter Plots

Observations

Relationship of ram with brand_name

Observations

Relationship of weight with battery > 4500 mAh

Observations

Relationship of big screen size and brands

Observations

Relationship of phones with selfie camera greater than 8mp and brands

Observations

Relationship of used_price with OS

Observations

Relationship of used_price with 4g and 5g

Obsservations

Data Preprocessing (contd.)

Column binning

Budget phone form 38.7 % of the total data whereas High End phones form only 24% of the total given data.

Outlier Detection

Observations

Outlier Treatment

Log transformation

Some features are very skewed and will likely behave better on the log scale. Lets transform new_price and used_price.

Observations

Linear Model Building

Create Dummy Variables

Let's check the coefficients and intercept of the model.

Let's check the performance of the model using different metrics.

Observations

Linear Regression using statsmodels

Observations

Checking Linear Regression Assumptions

We will be checking the following Linear Regression assumptions:

  1. No Multicollinearity

  2. Linearity of variables

  3. Independence of error terms

  4. Normality of error terms

  5. No Heteroscedasticity

TEST FOR MULTICOLLINEARITY

Observations

Since the VIF score for all the variables are less than 5 there is low multicollinearity and we do not have to To remove multicollinearity

TEST FOR LINEARITY AND INDEPENDENCE

Why the test?

How to check linearity and independence?

How to fix if this assumption is not followed?

Observations

TEST FOR NORMALITY

Why the test?

How to check normality?

How to fix if this assumption is not followed?

TEST FOR HOMOSCEDASTICITY

Why the test?

How to check for homoscedasticity?

How to fix if this assumption is not followed?

Observations

Since p-value < 0.05, we can say that the residuals are Heteroscedastic. Still we will continue with the model as the model is working fine with Budget and Mid Range phones. It could be due to outliers with High End phone price. Also we are provided with limited number of observations that restricts our ability to test.

Now that we have checked all the assumptions of linear regression and they are satisfied, we can move towards the prediction part.

Model performance evaluation

Note: As the number of records is large, for representation purpose, we are taking a sample of 25 records only.

Let's compare the initial model created with sklearn and the final statsmodels model.

Let's recreate the final statsmodels model and print it's summary to gain insights.

Final Model Summary

Conclusions

  1. new_price has significant relation with used_price. As the new_price increases, the used_price sqrt also increases by 0.60 euros, as is visible in the positive coefficient sign.
  1. As screen_size, main_camera_mp, selfie_camera_mp and int_memory increases, the used_price increases by not so significant value.
  1. As the weight, release_year, days_used increases , the used_price decreases as indicated by the negative coefficient.
  1. The increase in release_year also significantly decreases the used_price sqrt by ~0.25 euros.
  1. Phones with ios OS significantly increases the used_price sqrt by ~0.23 euros as compared to other OS. For phones with OS listed as "Others" there is significant decrease in used_price sqrt by 0.15 euros as compared to other OS.
  1. Phones with 4g decreases the used_price by 0.0868 euros as compared to phones without 4g. Phones with 5g decreases the used_price by 0.41 euros as compared to phones without 5g.

  2. Mid Range Phones and High End phones increases the used_price by 0.033 and 0.057 euros as compared to Budget phones.

Recommendations

Let's have a quick final overview of the data using the pandas profiling library.